NOTE: this has been superseded by the blog post in

In [210]:
# Standard lib
import re
import pickle
from collections import OrderedDict
from datetime import datetime

# Third-party 
from sqlalchemy import create_engine
import numpy as np
import matplotlib.pyplot as pl
%matplotlib inline
from sklearn.feature_extraction import text
from sklearn.utils.extmath import cartesian
import nltk
from nltk.stem.porter import PorterStemmer
import yaml
import pandas as pd

Path to configuration file with login information to the AAS SQL server

In [130]:
config_filename = "/Users/adrian/projects/aas-abstract-sorter/sql_login.yml"
with open(config_filename) as f:
    config = yaml.load(

Establish a database connection

In [135]:
engine = create_engine('mysql+pymysql://{user}:{password}@{server}/{database}'.format(**config))
_presentation_cache = dict()

Get all presentations and sessions from AAS 227

In [195]:
query = """
SELECT session.so_id, presentation.title, 
FROM session, presentation
WHERE session.meeting_code = 'aas227'
  AND session.so_id = presentation.session_so_id
  AND presentation.status IN ('Sessioned', '')
  AND session.type IN (
      'Oral Session'
    , 'Special Session'
    , 'Splinter Meeting'
result = engine.execute(query)
all_results = result.fetchall()
presentation_df = pd.DataFrame(all_results, columns=all_results[0].keys())
presentation_df['abstract'] = presentation_df['abstract'].str.replace('<[^<]+?>', '')

In [178]:
query = """
SELECT session.title, session.start_date_time, session.end_date_time, session.so_id
FROM session
WHERE session.meeting_code = 'aas227'
  AND session.type IN (
      'Oral Session'
    , 'Special Session'
    , 'Splinter Meeting'
ORDER BY session.so_id;
result = engine.execute(query)
session_results = result.fetchall()
session_df = pd.DataFrame(session_results, columns=session_results[0].keys())
session_df['start_date_time'] = pd.to_datetime(session_df['start_date_time'])
session_df['end_date_time'] = pd.to_datetime(session_df['end_date_time'])
session_df = session_df[1:] # zero-th entry has a corrupt date

Define a scikit-learn count vectorizer with a custom word tokenizer

In [152]:
# based on
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
    return stemmed

def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems

vectorizer = text.CountVectorizer(

Fit the count vectorizer to all AAS abstracts from AAS 227

In [155]:
count_matrix = vectorizer.fit_transform(presentation_df['abstract']).toarray()

(675, 5568)

As a quick check, what are the 10 most common words in AAS abstracts?

In [165]:
ten_most_common_idx = count_matrix.sum(axis=0).argsort()[::-1][:10]
feature_words = np.array(vectorizer.get_feature_names())

['galaxi' 'star' 'thi' 'observ' 'mass' 'use' 'model' 'survey' 'format'

For each pair of abstracts, compute the cosine similarity

In [244]:
similiarity_matrix = np.zeros((count_matrix.shape[0],count_matrix.shape[0]))
for ix1 in range(count_matrix.shape[0]):
    for ix2 in range(count_matrix.shape[0]):
        num = count_matrix[ix1].dot(count_matrix[ix2]) 
        denom = np.linalg.norm(count_matrix[ix1]) * np.linalg.norm(count_matrix[ix2])
        if num < 1: # if no common words, the vectors are orthogonal
            v = 0.
            v = num / denom
        similiarity_matrix[ix1,ix2] = v

Find the top ten most similar abstracts

In [245]:
similiarity_matrix_1d = np.triu(similiarity_matrix).ravel()
top_ten = sorted(np.unique(similiarity_matrix_1d[~np.isclose(similiarity_matrix_1d,1.)]), reverse=True)[:10]

In [246]:
for ix1,ix2 in zip(list(ix[0]), list(ix[1])):
    pres1 = get_presentation(presentation_ids[ix1])
    pres2 = get_presentation(presentation_ids[ix2])

Constraining the atmosphere of exoplanet WASP-34b
Analysis of Secondary Eclipse Observations of Hot-Jupiters WASP-26b and CoRoT-1b

Constraining the atmosphere of exoplanet WASP-34b
Atmospheric, Orbital and Secondary Eclipse Analysis of HAT-P-30-WASP-51b

Constraining the atmosphere of exoplanet WASP-34b
Secondary Eclipse Observations and Orbital Analysis of WASP-32b

Analysis of Secondary Eclipse Observations of Hot-Jupiters WASP-26b and CoRoT-1b
Atmospheric, Orbital and Secondary Eclipse Analysis of HAT-P-30-WASP-51b

Analysis of Secondary Eclipse Observations of Hot-Jupiters WASP-26b and CoRoT-1b
Secondary Eclipse Observations and Orbital Analysis of WASP-32b

Atmospheric, Orbital and Secondary Eclipse Analysis of HAT-P-30-WASP-51b
Secondary Eclipse Observations and Orbital Analysis of WASP-32b

How Giant Planets Shape the Characteristics of Terrestrial Planets
The Fragility of the Terrestrial Planets During a Giant Planet Instability

Galaxy Structure as a Driver of the Star Formation Sequence Slope and Scatter
Star formation histories of z~2 galaxies and their intrinsic characteristics on the SFR-M* plane

The Legacy of NASA Astrophysics E/PO: Conducting Professional Development, Developing Key Themes & Resources, and Broadening E/PO Audiences
The Legacy of NASA Astrophysics E/PO: Scientist Engagement and Higher Education

The Undergraduate ALFALFA Team: Collaborative Research Projects
The Undergraduate ALFALFA Team: A Model for Involving Undergraduates in Large Astronomy Collaborations

Those seem pretty similar! Looks like the code is working...

Now we'll predict which simultaneous sessions have the most overlap

For now, we'll start with the first day of conference talks, 5 Jan. We'll also only check for sessions that have the same start time (of course, we should really be looking at any overlapping sessions, but this is fine as a first pass...).

In [238]:
def session_similarity(so_id1, so_id2):
    Compute the similarity between two sessions by getting the sub-matrix of the 
    similarity matrix for all pairs of presentations from each session.
    presentations_session1 = presentation_df[presentation_df['so_id'] == so_id1]
    presentations_session2 = presentation_df[presentation_df['so_id'] == so_id2]
    if len(presentations_session1) == 0 or len(presentations_session2) == 0:
        # no presentations in session
        return np.array([])
    index_pairs = cartesian((presentations_session1.index,presentations_session2.index)).T
    sub_matrix = similiarity_matrix[(index_pairs[0],index_pairs[1])]
    shape = (len(presentations_session1), len(presentations_session2))
    sub_matrix = sub_matrix.reshape(shape)
    return sub_matrix

In [262]:
for name,group in session_df[session_df['start_date_time'] >= datetime(2016, 1, 5)].groupby('start_date_time'):
    for title1,so_id1 in zip(group['title'],group['so_id']):
        for title2,so_id2 in zip(group['title'],group['so_id']):
            if so_id1 >= so_id2: continue
            scores = session_similarity(so_id1, so_id2)
            if len(scores) == 0: # no presentations in one of the sessions
            if scores.max() > 0.5: # totally arbitrary threshold
                print(scores.max(), np.median(scores))

Intergalactic Medium, QSO Absorption Line Systems
Gas and Dust Content in Distant Galaxies
0.570158254448 0.223779436451

SDSS-IV MaNGA: Mapping Nearby Galaxies at Apache Point Observatory
The REsolved Spectroscopy Of a Local VolumE (RESOLVE) Survey and its Environmental COntext (ECO)
0.522251721857 0.245616652748

Extrasolar Planets: Hosts, Interactions, Formation, and Interiors
Formation and Evolution of Stars and Stellar Systems
0.500320769881 0.103147252391

Physical Properties of High Redshift Galaxies
Structure and Physics of Galaxies at z<~0.2
0.596678120671 0.300710183535

These are sessions that were scheduled for the same time-slot that have two talks with significant overlap between their abstracts.

In [ ]: